This report explores quality and other attributes of red wine for 1599 red wine samples and 14 variables
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality qualityFactor
## Min. : 8.40 Min. :3.000 3: 10
## 1st Qu.: 9.50 1st Qu.:5.000 4: 53
## Median :10.20 Median :6.000 5:681
## Mean :10.42 Mean :5.636 6:638
## 3rd Qu.:11.10 3rd Qu.:6.000 7:199
## Max. :14.90 Max. :8.000 8: 18
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The distribution appears left skewed with alcohol percent peaking approximately at 9.5 and 75% of the wines having alcohol less than 11.1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Most of the wines have Residual Sugar value less than 4 grams/liter. We do have some wines where the residual sugar is high. I would like to see how these wines are rated. Are sweeter wines rated higher? Is sweetness a criteria in rating a wine?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
Pretty Normal distribution of Density. Mean .9967, Median .9968
Represents the amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 mg/L, SO2 becomes evident in the nose and taste of wine
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
As can be seen above, majority of the wines are low in concentration of sulfur since So2 in high concentration can be detected in the taste of wine
Prevents microbial growth and oxidation of wine
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
I would like to explore how amount of free sulfur impacts the quality of wine since sulfur helps with wine antioxidation and prevents microbial growth. The above
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Most of the wines are of quality 5 and 6 and the range of wine is [3,8]
pH tells how acidic wine is . Range is from 0 (very acidic) to 14(very basic). 7 is neutral
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Looking at the above graph, All the wines are acidic in nature
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Most of the wines have volatile acidity less than 1 gram/liter which is towards the lower side. Higher values of volatile acidity can give wine a vinegar kind of taste.I am curious to find out if volatile acidity impacts the quality rating of the wine.I believe it should. There also appears to be some outliers on the right side of the graph
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Citric Acid adds freshness to the wine. Most of the wines have citric acid less than 0.75. I am curious to see if citric acid impacts the quality rating since it adds freshness to the wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Bell shaped graph with median value 7.9 and mean 8.32
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Most wines have sulphates less than 1. Since sulphate contributes to SO2 levels that helps with antioxidation and antimicrobial, I am curious to see how sulphate concentration impacts the quality rating of wine
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Most of the wine chloride is under .4. Few outliers
The dataset has 14 variables and 1599 observation. Quality is numerical and all other variable are non numerical
I would like to explore how quality is dependent on pH, density, alcohol
I think citric acid and volatile acidity may also impact quality. So I would like to explore these as well.
Yes , I created a quality factor variable
CItric acid contains an outlier at 1 gm/liter. Sugar also has few outliers exceeding 12 g. I wonder if it is the same wine that has these outliers.
Preparing plot between all the variables to see all correlated variable and decide which individual plots to dig into deeper
Some of the strongest correlations are citric and fixed acidity citric and volatile acidity(negative) fixed acidity and density ph and fixed acidity (negative) alcohol and density (negative)
Since the strongest relation of quality is with alcohol and volatile acidity, and since volatile has relation with citric and alcohol has relation with density, we may be able to explain quality through some of these indirect variables like density and citric acid. Plotting these variables will help figure out what all quality is dependent upon
Free sulfur dioxide and total sulfur dioxide has strong correlation. That may be because total sulfurdioxide is a combination of free and bound form of sulfurdioxide
Sulphates can contribute to SO2 but it doesn’t have any relation with free or total sulfur dioxide which is strange. May be its because we have very limited data
Residual sugar has extremely weak correlation with quality. It has some correlation with total dulfur dioxide and free sulfur dioxide and has the strongest correlation with density
Lets create some box plots to figure out the relation between some of the above variables and how do they impact quality
Above graph makes sense, since free sulfur dioxide is a part of total sulfur dioxide
It appears that as chlorides concentration increaases , sulphate concentration also increases
As we can see fixed acidity increase also increases citric acidity but an increase in citric acidity decreases volatile acidity. Thus fixed acidity and citric acidity has a positive relation with each other and a negative correlation with volatile acidity
As fixed acidity increases, density also increases. They have a strong correlation
As can be seen , increase in pH decreases the fixed acidity since higher pH has low acidity.Well, its interesting to see how increase in pH is increasing the volatile acidity but decreasing the fixed acidity. Higher pH should have low acidity but that true for relation between pH and fixed acidity and not between pH and volatile acidity. I wonder why.
Lets see how quality is related to various acids in the wine
Lower quality wines tend to have higher pH levels which means lower quality wines should have less acidity
As expected in this dataset, Higher quality wines have lower ph and lower volatile acidity (since ph and volatile acidity are positively related suprisingly) and higher quality wines have higher median value of citric acid and fixed acidity.
Lets see relation between quality and alcohol
With the exception of qines with quality 5, alcohol content appears to be increasing in higher quality wines. The middle quality also have a lot of outliers that may be the reason of a low median in that quality.
As the alcohol content increases , density decreases. This makes sense since the density of water is higher than alochol. Since higher quality wines have high alcohol content, the density should reduce as the quality increases. Let plot density and quality to confirm the relationship
With the exception of quality 5 wines, the density appears to be reducing as the quality increases. The quality 5 may be behaving this way because of many outliers as seen above. But in general , the density reduces as the quality increases
Lets figure out the relation between sulphates and quality. As per the correlation matrix, sulphates do have some correlation with quality
There are quiet a few outliers here so we may need to remove outliers to figure out the exact relationship. Looking at the above plot, it appears that higher quality wines have high sulphate content.
Chlorides appears to be having some correlation with density, pH, sulphates and alcohol. Since some of these attributes have relation with quality, I would like to see what relation chloride have with these variables and if chloride is anywhere determining quality. Lets plot chloride with these attributes
As per the above graph, these doesnt appear to be any significant relation between quality and chlorides. Chlorides have a positive relation with sulphates and sulphates have a positive relation with quality but chlorides doesnt appear to be impacting quality
As quality increases, alcohol content increases
As quality increases , citric acid increases, volatile acidity decreases and fixed acidity increases
As quality increases, density decreases
As quality increases, ph value decreases
As quality increases, sulphates content increases
Chlorides doesnt have an impact on quality
Residual Sugar doesnt have an impact on quality
Higher alcohol content has lower density
Fixed acidity increases as density increases
Fixed acidity decreases as ph Increases
Citric acid decreases as pH increases
Volatile acidity increases as pH increases
Free sulfur dioxide is positively related to total sulfur dioxide
Chlorides have a slight positive relation with sulphates
Quality is positively related to alcohol. Correlation .476
Free Sulfur Dioxide is positively related to total sulfur dioxide .668
Looking at the above graph, it can be said that wines with higher alcohol content and higher citric acid content seem to be of higher quality
It appears that higher alcohol and lower volatile acidity tends to produce better quality wines
Lower density and higher alcohol content seem to be producing better quality wines
Higher sulphate and higher alcohol content produces better quality wines Try to figure out impact of different kinds of acids on quality
Lower volatile acidity and higher citric acid produces better wine
Not much correlation here
Not Much correlation here
low pH and high alcohol content tend to produce better wines
Not much correlation here
It looks like wines with higher quality content are better if they have high sulphates in them.
##
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = winequality)
## m2: lm(formula = as.numeric(quality) ~ alcohol + citric.acid, data = winequality)
## m3: lm(formula = as.numeric(quality) ~ alcohol + citric.acid + sulphates,
## data = winequality)
## m4: lm(formula = as.numeric(quality) ~ alcohol + citric.acid + sulphates +
## density, data = winequality)
## m5: lm(formula = as.numeric(quality) ~ alcohol + citric.acid + sulphates +
## density + pH, data = winequality)
##
## ========================================================================================
## m1 m2 m3 m4 m5
## ----------------------------------------------------------------------------------------
## (Intercept) 1.875*** 1.830*** 1.434*** 19.459 20.500
## (0.175) (0.171) (0.176) (12.063) (12.040)
## alcohol 0.361*** 0.346*** 0.338*** 0.321*** 0.337***
## (0.017) (0.016) (0.016) (0.020) (0.021)
## citric.acid 0.730*** 0.513*** 0.583*** 0.405***
## (0.090) (0.093) (0.104) (0.121)
## sulphates 0.814*** 0.829*** 0.811***
## (0.107) (0.107) (0.107)
## density -17.931 -17.745
## (11.998) (11.971)
## pH -0.402**
## (0.139)
## ----------------------------------------------------------------------------------------
## R-squared 0.227 0.257 0.284 0.285 0.288
## adj. R-squared 0.226 0.256 0.282 0.283 0.286
## sigma 0.710 0.696 0.684 0.684 0.682
## F 468.267 276.595 210.501 158.556 129.110
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1688.711 -1659.955 -1658.835 -1654.636
## Deviance 805.870 773.917 746.576 745.532 741.626
## AIC 3448.114 3385.421 3329.910 3329.671 3323.272
## BIC 3464.245 3406.930 3356.795 3361.934 3360.912
## N 1599 1599 1599 1599 1599
## ========================================================================================
Relationship between volatile acidity, alcohol and quality of wine was surprising as they were not much related
Yes, I created a linear model and noticed that sulphates had the strongest impact on the quality of wine. pH, citric acid, and alcohol also are somewhat impacting the quality of wine but not as much as sulphates The dataset contains most of the wines with average quality. It would have been good if the data was varied and had wines from different qualities
#Univariate Analysis of quality
ggplot(aes(x=quality , color=I('black'), fill=I('yellow')), data=winequality)+
geom_bar()+
xlab("Quality Rating of Wine")+
ylab("Count of Wine")
The plot explains the quality in the dataset. Any conclusions made from this dataset will be applied to the “average” quality wines
#Relation between quality and Sugar
ggplot(aes(y=residual.sugar, x=qualityFactor, color="Red"), data=winequality)+
geom_boxplot()+
xlab("Quality")+
ylab("Sugar (g / dm^3)")
To my surprise, residual sugar has no impact on the quality of wine.
#Relation between quality and sulphates
ggplot(aes(y=sulphates, x=qualityFactor, color="Red"), data=winequality)+
geom_boxplot()+
xlab("Quality")+
ylab("Sulphates (g / dm3)")
Sulphates play a major role in determining the quality of the wine
The dataset provided only had wines with average quality. Not many extreme quality wines were part of the dataset. As a result during exploration, it was very difficult to come to a satisfying conclusion provided that I knew most of the data is from wine quality 5 and 6.
I was also surprised to see how residual sugar and sulfur did not play as big a role in defining the quality as I assumed it will.
For future explorations, I would like to get more wines in quality other than 5 and 6. Also in the current dataset , there was just one categorical variable. Adding more categorical variable for future exploration will surely help